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ABSTRACT 

This paper presents a statistically sound method for using 
likelihood to assess potential models of network evolution. 
The method is tested on data from five real networks. Data 
from the internet autonomous system network, from two 
photo sharing sites and from a co-authorship network are 
tested using this framework. 
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1. INTRODUCTION 

It has been found that networks arising in very differ- 
ent contexts share some structural statistical properties, for 
example a power law in their degree distribution. Such net- 
works include the Internet Autonomous System (AS) topol- 
ogy, the WWW hyperlink graph, co-authorship networks, 
sexual contact networks, social networks based on email ex- 
change, biological networks and others. For examples and 
references see [31 table 3.1] 

One common hypothesis for the basis of these shared char- 
acteristics is the presence of common elementary network 
development processes, such as the preferential attachment 
model of Barabasi et al. [2]. Other models have been pro- 
posed for the evolution of specific classes of networks. Many 
authors have proposed models which attempt to explain the 
evolution of a target network in terms of simple rules which 
produce artificial networks with the same characteristics as 
a given target network. Examples of models of this kind 
can be found in [T] [2J SI [TT] . In the literature, such models 
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are usually tested by growing an artificial model of the same 
size as the target network and comparing several network 
statistics on the real and artificial networks. 

In this paper we propose the Framework for Evolutionary 
Topology Analysis (FETA). This framework provides several 
advantages when compared to the usual method of testing 
models: a single likelihood based measure of how well a 
proposed model explains the observed network evolution, it 
uses network evolution data rather than a single static net- 
work snapshot, and it includes a method for creating new 
network models from linear combinations of sub-models and 
a method for optimising the mixture of these sub-models. 
The FETA allows the assessment of network models with- 
out growing artificial models and comparing them to the 
target network, making model testing much faster. This pa- 
per is a companion to [5] which introduced the framework 
and showed it could recover known model parameters for 
artificial network models. This paper shows that the FETA 
framework can be used to investigate a variety of real net- 
works. The class of models which FETA can work with 
includes Barabasi- Albert (BA) [2], Albert-Barabasi (AB) 
PP, Generalised Linear Preference (GLP) [3] and Positive 
Feedback Preference (PFP) [11]. 

2. A LIKELIHOOD BASED FRAMEWORK 
FOR ASSESSING NETWORK MODELS 

The probabilistic models used by FETA are described in 
terms of two components referred to as an inner model and 
an outer model. 

Definition 1. The outer model chooses the operation that 
transforms the current network. This could be add a node, 
add a link between existing nodes, delete a node or delete a 
link between nodes. 

Definition 2. The inner model chooses the entity on which 
the operation will act. More simply it defines a probabilistic 
model which gives the probabilities of choosing nodes or links 
for the add or remove operation selected by the outer model. 

A simple example would be the AB model. This would 
correspond to an outer model which adds a new node and 



then chooses exactly three inner nodes to connect to it. The 
inner model assigns probabilities to each inner node exactly 
proportional to their node degree. As is common in the 
literature, the main focus of FETA is on the inner model. 
The framework is flexible enough to allow or disallow node 
and edge removal, non-simple and directed graphs. For this 
paper, however, only connected, simple, undirected graphs 
which never lose nodes or edges are considered. 

Let Go be the known state of the graph at a certain time. 
Assume that graph is known for each time an edge is added 
up to some step t (Go, Gi, . . . , Gt is known). Let 6^ be a 
proposed model which attempts to explain this evolution. 
The model d assigns probabilities to entities in the network 
at each step of the network evolution. 

In order to simplify the explanation, assume that the outer 
model always involves the choice of a single existing node to 
connect to a new node. Let G = (A^'i , . . . , A^t) be the ordered 
list of nodes selected at each step derived from Go, . • . , Gt. 
Let Pi{j\d) be the probability that inner model 6 assigns to 
node j at step i - that is, the probability that node j is 
chosen at step i. To be a valid model 9 should ensure that 
"^^jPiUW ~ 1 where the sum is over nodes. The following 
theorem can easily be shown [5]. 

Theorem 1. Let C = (Ai, . . . ,Nt) be the observed node 
choices at steps 1, . . . ,t of the evolution of the graph G. Let 
9 be some hypothesised valid inner model which assigns a 
probability pj{i\9) to node i at step j. The likelihood of the 
observed C given 9 is 

t 

L{C\e) = Y[pAN,\9). 

Note that the probability pj{i\9) may depend on many 
things including past history, node properties exogenous to 
the graph and previous node choices. As long as these are 
observable then the calculation is still easy to make. The 
next step is to deflne a null model Oq to compare the hy- 
pothesised model 9 against. 

Definition 3. The null model 9o is defined as the model 
which gives every node in the choice set equal probability 
(this can also be thought of as the random model^. The per 
choice likelihood ratio co is the likelihood ratio between 9 and 
the null model 9o normalised by the number of choices. 

_ r L{c\9) 

[l{C\9o)_ • 

The quantity co is one if 9 is exactly as likely as 9o to 
have given rise to the observed choices G. If cq is greater 
than one then 9 is more likely and if less than one it is less 
likely. Note that two hypothesised models can be compared 
by looking at the ratio of their cq values. Note also that 
Co and L{C\9) are simply different ways of looking at the 
model likelihood. It is also worth noting that, using the 
same underlying source code to calculate the probabilities, 
generating an artificial network model of the same size as 
the real target network took much longer (sometimes a hun- 
dred times as long) than measuring the likelihood statistics. 
Other standard statisitics such as deviance and Akakai's 
Information Criterion can also trivially be calcualted from 
L{C\9). 



U 9i i = 1,2, N are valid models for a given network 
then 9 = E^i with Pi G [0, 1] and ft = 1 is 

also a valid model. This allows sub models to be linearly 
combined to form hybrid models. The linear /3 parameters 
and other model parameters (for example the S in the PFP 
model) can be optimised to find the model which has the 
highest Co value for a given target network. Optimisation of 
the j3 parameters can be performed using generalised linear 
modelling as described in [5]. 

Let di be the degree of node i and Ti be the triangle 
count (the number of triangles, or 3-cycles, the node is in). 
The model components considered for this paper included 
the following: 9o - the null model (random model) assumes 
all nodes have equal probability Pi = ko; 9^ - the degree 
model (preferential attachment) assumes node probability 
Pi — kddi; 9t - the triangle model assumes node probability 
Pi — ktTi; 9s - the singleton model assumes node probability 
Pi — ks a di — 1 and pi = otherwise; 9d - the doubleton 
model assumes node probability pi = fc_D if di = 2 and pi = 
otherwise; 9ii{n) - the "recent" model where Pi = kn if a 
node was one selected in the last n selections and = oth- 
erwise and 61^*^ - the PFP model assumes node probability 
Pi — fcpd^^'^'°^^°''*''. The fc. are all normalising constants 
to ensure X^^Pi — 1. 

So, for example 9 = 0.561^ -I- 0.46lp(0.05) 0.16's is a model 
which is 50% preferential attachment, 40% PFP with 5 = 
0.05 and 10% singleton model. 

3. REAL DATA TESTING 

The FETA procedure is used to create inner models for 
several networks of interest. Section [3. II fits models to a co- 
authorship network inferred from the arXiv database. Sec- 
tion 13.21 fits models to a view of the AS network topology 
referred to here as the UCLA AS network and section lS^ flts 
models to a second view of the AS topology, which we refer 
to here as the RouteViews AS network. Section [33] fits net- 
work evolution models to a network derived from user brows- 
ing behaviour on a photo sharing site known as "gallery" and 
section [53] fits models to a social network derived from the 
popular photo sharing site Flickr. The networks are sum- 
marised below. 



Network 


edges nodes edge/node 


arXiv 
UCLA AS 
RouteViews AS 
gallery 
Flickr 


15,788 9,121 1.73 
93,957 29,032 3.24 
94,993 33, 804 2.81 
50,472 26,958 1.87 
98,931 46,557 2.13 



For each data set, three inner models are tried: a ran- 
dom model, a pure PFP model (with an optimally tuned 
5 for connections to new nodes, and another one for inter- 
nal edges) and the best model found by trying all combina- 
tions of submodels using the generalised linear model fitting 
procedure described in [5] and maximising the per choice 
likelihood ratio co - separate inner models are fitted to con- 
nections from new nodes and connections between existing 
nodes. These models will be called, for convenience, ran- 
dom, PFP and best - where best here should be understood 
as the best possible model using combinations of the sub- 
models consdered rather than being the best possible model 
of the network. Note that this model does not contain the 
interactive growth model from JTj and the results that fol- 



low should not be taken as a criticims of PFP as described 

in m 

Because the outer model was not the subject of interest 
here the outer model was simply taken to be the actual op- 
eration observed in the real data. In practice this was little 
different from the results obtained from the outer model de- 
rived simply by calculating empirically from the data two 
distributions: 1) the number of inner nodes each new node 
connects to on arrival, 2) the number of inner edges con- 
nected between each new node arrival. The outer model 
behaviour can be drawn from these distributions and the 
results are little changed. 

For each model, Co from definition |3] is measured. Several 
network statistics are then measured for comparison. Simple 
statistics were chosen: di is the proportion of nodes which 
have degree one and d2 the proportion of nodes with degree 
two, max d is the maximum degree of any node and is the 
mean square of the node degrees (a measure of variance) - 
note that d is not a useful measure, it is set by the outer 
model and would be the same for all models. The clustering 
coefficient 7 is a measure of the proportion of possible tri- 
angles present in the graph. The assortativity coefficient r 
is positive when nodes attach to nodes of like degree (high 
degree nodes attach to each other) and negative when high 
degree nodes tend to attach to low degree nodes. For full 
definitions of all these quantities see [7]. 

3.1 Fitting the arXiv data set 

A publication co-authorship network was obtained from 
the online academic publication network arXi\Q. The first 
paper was added in April 1989 and papers are still being 
added to this day. To keep the size manageable, the network 
was produced just from the papers with the category label 
"math". The network is a co-authorship network: an edge is 
added when two authors first write a paper together. The 
author match is on first initial and surname, though it is 
clear this will allow some collisions. One papeQ was removed 
from analysis. The paper has 60 authors (far more than the 
next largest) which would add a distorting size 60 clique 
(1,732 links). The arXiv network has also been analysed by 
(amongst others) ^ from the perspective of growth rates 
and clique addition. 

Obviously the random model has co = 1. The pure PFP 
model has 5 = —0.17 and Co = 1.31. The best model has 
the model for connecting to new nodes 0.56Sp(— 0.29) -I- 
0.28e'_R(3) -I- OAQds (PFP -I- recent -I- singleton) and the 
model for connecting between existing nodes 0.576p{—0.03) + 
0.396*11(3) -1-0.0403 (PFP -I- recent -I- singleton) together this 
gives Co = 6.24. This implies that PFP should be slightly 
better than random and best should be better than both. 

Figure [1] shows the results for the arXiv data. As can be 
seen, for di and maxd the results are in the order predicted 
and, for the best model are a good fit to the real data. For d2 
random is slightly better than PFP. For d^ PFP is a better fit 
than the best model although both are very similar and quite 
close to the real data. For 7 and r all models are similar and 
similarly bad fit to the real data. No models have captured 
these second and third order statistics. The obvious reason 
for this is that uniquely in the arXiv data nodes are all added 
as cliques. If n authors write a paper together then a clique 
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of size n (some nodes in which are already present on the 
network) is added. An obvious improvement to the model 
could be obtained by having "add clique of size n" as an outer 
model operation and an inner model which selected which 
node(s) in the clique were already present in the network. 

3.2 UCLA AS data set 

The data set we refer to here as the UCLA AS data set is a 
view of the Internet AS topology seen between January 2004 
and August 2008. It comes from the Internet topology col- 
lectior0 maintained by Oliviera et al. [10]. These topologies 
are updated daily using data sources such as BGP routing 
tables and updates from RouteViews, RIPEQ Abilen43 and 
LookingGlass servers. Each node and link is annotated with 
the times it was first and last observed during the measure- 
ment period. The AS data set has been analysed by several 
other researchers but few have analysed the data set as it 
grows. ^ uses linear modelling techniques to assess the 
goodness of fit of a preferential attachment model. 

The data is preprocessed by removing all edges and nodes 
which are not seen in the final sixty days of the data, so 
that the final state of the evolution of the network is the AS 
network as it is in August 2008. Edges are introduced into 
the network in the order of their first sighting. If this would 
cause the network to become disconnected, their introduc- 
tion is delayed. Data is available from January 2004 and 
a "warm up" period is given with Go (the starting graph) 
taken to occur slightly after this start date.s 

For the UCLA data the best pure PFP model was with 5 = 
0.0015 which had co — 6.326. The best model was, for the 
model to connect to new nodes 0.8161^(0.0015) -|- 0.196Ih(1) 
and for the model to connect between existing edges 0.759,1 + 
0.26ifl(l) -H0.056is. This model had co = 11.43. 

Figure [2] shows the results for the UCLA network. For 
di, d2, maxd and d^ the results are in the expected order 
and for all but d2 are quite close (no model predicts d2 very 
well). For assortativity, PFP is slightly better than best. 
For clustering coefficient no models are correct. 

3.3 RouteViews AS data set 

For the present paper we define the RouteViews AS data 
set as the view of the Internet AS topology from the point 
of view of a single RouteViews data collector. The raw data 
used to construct it comes from the University of Oregon 
Route Views ProjeclQ. A fuller description can be found in 
g]. The best pure PFP model was 6lp(0.005) and the best 
model found which was 0.810^(0.014) + 0.170^(1) (PFP + 
"recent") to connect new nodes and 0.716'ti + O.220_r(1) -|- 
0.070s (preferential attachment -f- "recent" -I- singleton) to 
connect edges between existing nodes. The PFP model 
0p(O.OO5) had co = 4.81 and the best model had co = 8.06. 
This suggested that best would be better than PFP which 
would be better than random. 

For most statistics, the models were in the order expected 
but for 7 and r PFP was slightly better than best. For di 
and maxd the PFP model was little different to random 
although in the case of max d, random predicted unrealisti- 
cally slow growth. Overall, however, the model order was 
that predicted by the co values. 

"(http : //irl . cs .ucla.edu/topology/ 
^http : //www . ripe .net/db/ irr .html/ 
"http: //abilene . internet2 . edu/| 
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Figure 1: Results for arXiv network 
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Figure 2: Results for UCLA network 



3.4 Fitting the gallery data set 

The website known simply as "gallery'Q is a photo sharing 
website. To be able to upload pictures and have some control 
over the display of pictures, users have to create an account 
and login. From webserver logs, the path logged in users 
browse as they move across the network can be followed. 
Thus, images become nodes in the networks, and a user 
browsing between two photos creates a link between the two 
nodes that represent them. These links are overlaid for all 
users in order to form the network analysed here. 



' |http : /Tgallery . future- i . com/ 1 



The best pure PFP model for the gallery data was with 
5 = —0.4, however, unusually, this model was worse than 
random with co = 0.8515. The best model had, for its con- 
nections to new nodes, 0.570s + 0.246*^ + 0.1961^(3) (single- 
ton + preferential attachment plus "recent") and for its con- 
nections between existing nodes 0.616'p(— 0.05) -I- 0.39&j;(5). 
This model had a per choice likelihood ratio co = 12.93. 

Figure [3] shows the results for the gallery network. From 
the Co values we would expect random to actually be slightly 
better than PFP and best to be much better than either. 
This order is followed for di, 3.2 and d? and seems to be for 
7 although all models are incorrect here. For assortativity 




Figure 3: Results for gallery network 



r PFP is unexpectedly the best model and for maxd it is 
better than random. In all cases apart from r the best model 
is closest to the real data. Again the co statistic seems to 
be a good reflection of the closeness of network statistics, 
particularly "first order" statistics. 

3.5 Fitting the Flickr data set 

The FhckiQ website allows users to associate themselves 
with other users by naming them as Contacts. In [9] the au- 
thors describe how they collected data for the graph made 
by users as they connect to other users. The first 100,000 
links of this network are analysed here. The graph is gen- 
erated by a web-crawling spider so the order of arrival of 
edges is the order in which the spider moves between the 
users rather than the order in which the users made the 
connections. Thus, the evolution dynamics of this network 
will be determined in part by the spidering code. 

The best pure PFP model for the Flickr data was with 5 — 
0.015 and this had co = 28.29. The best model had as the 
model for new node connections simply 0.996^(1) + O.OlOd 
and for connections between existing nodes the best model 
was O.520p(-O.22)-f 0.486lfl(l). This model had the very high 
per choice likelihood ratio of co = 430.5 - this is because the 
new node model is almost entirely deterministic, new nodes 
follow by browsing from old nodes. It is because the network 
was from a browsing pattern that gave the high proportion 
of "recent" especially in the new node model. 

Figure |4] shows the results for the Flickr network. From 
the Co values we would expect the best model to be much bet- 
ter than the PFP model which is in turn much better than 
the random model. In fact this is not refiected as strongly 
in the statistics as in the previous modelling. For di , d2 and 
r the statistics are as expected and best is relatively close. 
For 7, PFP is worse not better than random. For maxd 
and no models are good and the order is not that pre- 
dicted - PFP is slightly better than best. This may be due 

^http : //flickr. com/ 1 



to the presence of a single extremely high degree node (de- 
gree 11,053 when the network has only 46,557 nodes) more 
than ten percent of the links in the network are to this single 
node. 

3.6 Discussion of model fitting 

In general the FETA model assessment performed ex- 
tremely well in these tests. The models were fitted solely 
with regard to the likelihood value, without measuring net- 
work statistics in advance. In all cases, we believe an im- 
partial observer would rank the models in the same order 
as the Co values. FETA was much faster than growing and 
testing many models. A GLM (generalised linear model) 
procedure as described in [5] allows optimisation of linear 
parameters and dozens of potential sub model combinations 
can be tested in the space of an hour or so. Growing artifi- 
cial networks and testing network statistics can take longer 
than this to assess a single model. The submodels used fo- 
cused on first degree node properties (mainly degree) and 
this may explain why 7 and r were not always well fitted. 

Some common observations can be made about the mod- 
els fitted. PFP and "recent" were the most commonly used 
model components. As expected, PFP models had a nega- 
tive 5 (sublinear growth) when the node might be subject 
to overloading (an author can only author so many papers 
a Flickr user can only have time for a certain number of 
friends) but positive in systems where no such overloading 
was likely (an AS will become more efficient at adding con- 
nections as more people add them). The "spidering" nature 
of the Flickr data produced an unusual model for new node 
connections which were almost always connected from the 
most recently connected node, this makes sense in a "crawl- 
ing" environment. (The likelihood of Or{1) on its own was 
zero since at least once this was not the behaviour observed) . 
The two AS data sets ended up with quite similar models 
which is extremely encouraging as the fitting was done in- 
dependently. 




Figure 4: Results for Flickr network 



4. CONCLUSIONS 

The FETA framework demonstrated in this paper is an 
excellent way to test hypothesised models of network evolu- 
tion if the data set allows this (evolutionary data must be 
available). In the tests here the model likelihood Co was an 
excellent predictor of how close network statistics would be 
to those same statistics measured on the target network. In 
addition the statistics usually behaved in the same way as 
the network evolved. The framework has proved a useful 
tool for the investigation of five real target networks. 

The model components here did a reasonable but far from 
perfect job of replicating the real model statistics. However, 
the aim of the paper was to show that the framework could 
assess models not to design perfect models. In this case the 
models most common failure was failure to replicate cluster- 
ing coefficient and assortativity. This is perhaps inevitable 
as the models were built from components which relied on 
first order statistics. Altering the inner model to include sec- 
ond order statistics or altering the outer model (for example 
to allow addition of cliques) could improve this behaviour. 

Overall though, the FETA framework is an advance in 
assessment of network topology models. It accounts for the 
evolution of the network rather than trying to match a static 
snapshot. It provides a single statistically rigorous likeli- 
hood for a model rather than relying on trying to match a 
large number of possibly correlated statistics. It is computa- 
tionally cheaper than growing an artificial test network and 
measuring statistics to compare with the target network. 

Much remains to be done with FETA to improve it. The 
outer model needs attention next and it seems that a simi- 
lar likelihood procedure would prove successful here. Many 
different sub models can be tried, in particular focussing on 
second and higher order statistics seems important. The au- 
thors welcome collaboration and all software and data used 
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here can be found at http : / /www . richardclegg . org/ sof tware/FETA 
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